PROBLEM STATEMENT:¶

Customer churn is when a company’s customers stop doing business with that company. Businesses are very keen on measuring churn because keeping an existing customer is far less expensive than acquiring a new customer. New business involves working leads through a sales funnel, using marketing and sales budgets to gain additional customers. Existing customers will often have a higher volume of service consumption and can generate additional customer referrals.

Customer retention can be achieved with good customer service and products. But the most effective way for a company to prevent attrition of customers is to truly know them. The vast volumes of data collected about customers can be used to build churn prediction models. Knowing who is most likely to defect means that a company can prioritise focused marketing efforts on that subset of their customer base.

Preventing customer churn is critically important to the telecommunications sector, as the barriers to entry for switching services are so low.

You will examine customer data from IBM Sample Data Sets with the aim of building and comparing several customer churn prediction models.¶

  • Importing require library for performing EDA, Data Wrangling and data cleaning
In [1]:
import pandas as pd # for data wrangling purpose
import numpy as np # Basic computation library
import seaborn as sns # For Visualization 
import matplotlib.pyplot as plt # ploting package
import warnings # Filtering warnings
warnings.filterwarnings('ignore')
In [2]:
# Importing Customer Churn Analysis dataset Csv file using pandas
In [3]:
df=pd.read_csv('Telecom_customer_churn.csv')
In [4]:
df.head()
Out[4]:
customerID gender SeniorCitizen Partner Dependents tenure PhoneService MultipleLines InternetService OnlineSecurity ... DeviceProtection TechSupport StreamingTV StreamingMovies Contract PaperlessBilling PaymentMethod MonthlyCharges TotalCharges Churn
0 7590-VHVEG Female 0 Yes No 1 No No phone service DSL No ... No No No No Month-to-month Yes Electronic check 29.85 29.85 No
1 5575-GNVDE Male 0 No No 34 Yes No DSL Yes ... Yes No No No One year No Mailed check 56.95 1889.5 No
2 3668-QPYBK Male 0 No No 2 Yes No DSL Yes ... No No No No Month-to-month Yes Mailed check 53.85 108.15 Yes
3 7795-CFOCW Male 0 No No 45 No No phone service DSL Yes ... Yes Yes No No One year No Bank transfer (automatic) 42.30 1840.75 No
4 9237-HQITU Female 0 No No 2 Yes No Fiber optic No ... No No No No Month-to-month Yes Electronic check 70.70 151.65 Yes

5 rows × 21 columns

In [5]:
# As we have 31 Columns Lets sort Columns by their datatype
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 21 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   customerID        7043 non-null   object 
 1   gender            7043 non-null   object 
 2   SeniorCitizen     7043 non-null   int64  
 3   Partner           7043 non-null   object 
 4   Dependents        7043 non-null   object 
 5   tenure            7043 non-null   int64  
 6   PhoneService      7043 non-null   object 
 7   MultipleLines     7043 non-null   object 
 8   InternetService   7043 non-null   object 
 9   OnlineSecurity    7043 non-null   object 
 10  OnlineBackup      7043 non-null   object 
 11  DeviceProtection  7043 non-null   object 
 12  TechSupport       7043 non-null   object 
 13  StreamingTV       7043 non-null   object 
 14  StreamingMovies   7043 non-null   object 
 15  Contract          7043 non-null   object 
 16  PaperlessBilling  7043 non-null   object 
 17  PaymentMethod     7043 non-null   object 
 18  MonthlyCharges    7043 non-null   float64
 19  TotalCharges      7043 non-null   object 
 20  Churn             7043 non-null   object 
dtypes: float64(1), int64(2), object(18)
memory usage: 1.1+ MB

Comment :¶

  • We have 7043 Rows and 21 Columns in this Telecom Dataset.
  • We have target Variable 'Churn' with object datatype, leading this to classification problem.
  • There is interesting entry here under object datatype which is 'TotalCharges'. This feature is numerical in nature but categories as Object datatypes. This implies that there is presence of string variable in this column or might be data error.
  • 'SeniorCitizen' is categorical variable listed as Numerical variable. So we gone convert it into object datatype.
  • At end we have 3 Numerical variable and 18 categorical variable. Out of which 'CustomerID' is unnecessary variable from our analytical & modelling viewpoint. We will drop 'CustomerID' column.

We are going to Group Variable into Numerical and Categorical variables list in order to simplify further analysis. Next thing is droping CustomerID Column.¶

In [6]:
df.drop(['customerID'],axis=1,inplace=True)

Statistical Analysis¶

Before Going for Statistical exploration of data, first check integrity of data & Missing value

Data Integrity Check¶

Since dataset is large, Let check for any entry which is repeated or duplicated in dataset.

In [7]:
df.duplicated().sum()  # This will check the duplicate data for all columns.
Out[7]:
22

We can see that 22 duplicate entry in dataset. Let drop duplicated entry from dataset.

In [8]:
df.drop_duplicates(keep='last',inplace= True)
In [9]:
df.shape
Out[9]:
(7021, 20)
In [10]:
df["TotalCharges"].unique()
Out[10]:
array(['29.85', '1889.5', '108.15', ..., '346.45', '306.6', '6844.5'],
      dtype=object)

Now check for any whitespaces, NA,'-' in dataset. We might find something in TotalCharges column by considering Object datatype .¶

In [11]:
df[df["TotalCharges"].isin([' ','NA','-'])==True]
Out[11]:
gender SeniorCitizen Partner Dependents tenure PhoneService MultipleLines InternetService OnlineSecurity OnlineBackup DeviceProtection TechSupport StreamingTV StreamingMovies Contract PaperlessBilling PaymentMethod MonthlyCharges TotalCharges Churn
488 Female 0 Yes Yes 0 No No phone service DSL Yes No Yes Yes Yes No Two year Yes Bank transfer (automatic) 52.55 No
753 Male 0 No Yes 0 Yes No No No internet service No internet service No internet service No internet service No internet service No internet service Two year No Mailed check 20.25 No
936 Female 0 Yes Yes 0 Yes No DSL Yes Yes Yes No Yes Yes Two year No Mailed check 80.85 No
1082 Male 0 Yes Yes 0 Yes Yes No No internet service No internet service No internet service No internet service No internet service No internet service Two year No Mailed check 25.75 No
1340 Female 0 Yes Yes 0 No No phone service DSL Yes Yes Yes Yes Yes No Two year No Credit card (automatic) 56.05 No
3331 Male 0 Yes Yes 0 Yes No No No internet service No internet service No internet service No internet service No internet service No internet service Two year No Mailed check 19.85 No
3826 Male 0 Yes Yes 0 Yes Yes No No internet service No internet service No internet service No internet service No internet service No internet service Two year No Mailed check 25.35 No
4380 Female 0 Yes Yes 0 Yes No No No internet service No internet service No internet service No internet service No internet service No internet service Two year No Mailed check 20.00 No
5218 Male 0 Yes Yes 0 Yes No No No internet service No internet service No internet service No internet service No internet service No internet service One year Yes Mailed check 19.70 No
6670 Female 0 Yes Yes 0 Yes Yes DSL No Yes Yes Yes Yes No Two year No Mailed check 73.35 No
6754 Male 0 No Yes 0 Yes Yes DSL Yes Yes No Yes No No Two year Yes Bank transfer (automatic) 61.90 No

There is possibility of whitespaces in TotalCharges column. lets deal with it.

In [12]:
# Replaceing Whitespaces with null values
df['TotalCharges']= df['TotalCharges'].replace(' ',np.nan)
In [13]:
# Converting object datatype into float
df['TotalCharges']= df['TotalCharges'].astype(float)
In [14]:
df['TotalCharges'].fillna(df['TotalCharges'].mean(),inplace=True)

We have remove whitespaces so let now check for missing values

We can impute missing value in TotalCharges either with mean and median. We can decide imputation method after checking distribution and Outliers in data¶

In [15]:
plt.figure(figsize=(13,5))
plt.subplot(1,2,1)
sns.boxplot(y='TotalCharges', data=df,color='cyan')
plt.ylabel('TotalCharges',fontsize=15)
plt.subplot(1,2,2)
sns.distplot(df['TotalCharges'], color='b')
plt.xlabel('TotalCharges',fontsize=15)
plt.tight_layout()
plt.show()
No description has been provided for this image
In [16]:
print("Mean of TotalCharges:",df['TotalCharges'].mean())
print("Median of TotalCharges:",df['TotalCharges'].median())
Mean of TotalCharges: 2290.3533880171185
Median of TotalCharges: 1410.25

Observation:¶

  • We can see that Outliers doesnot exist, so no mean sensitivity issue present here.
  • Distribution plot shows that Total Charges feature is right skewed.
  • Mean is greater than Median.

Considering above observation we can impute Missing value with Mean.

Imputation of Missing value in TotalCharges with Mean¶

In [17]:
df['TotalCharges']=df['TotalCharges'].fillna(df['TotalCharges'].mean())

Checking for Null values after Imputation¶

In [18]:
plt.figure(figsize=(9,6))
sns.heatmap(df.isnull(),cmap="cool_r")
plt.show()
No description has been provided for this image

Comment :¶

Finally, No Missing Value is Present.

We are Now Yes To Go Further !!!

Statistical Matrix¶

In [19]:
df.describe().T.style.background_gradient(subset=['mean','std','50%','count'], cmap='RdPu')
Out[19]:
  count mean std min 25% 50% 75% max
SeniorCitizen 7021.000000 0.162512 0.368947 0.000000 0.000000 0.000000 0.000000 1.000000
tenure 7021.000000 32.469449 24.534965 0.000000 9.000000 29.000000 55.000000 72.000000
MonthlyCharges 7021.000000 64.851894 30.069001 18.250000 35.750000 70.400000 89.900000 118.750000
TotalCharges 7021.000000 2290.353388 2265.044136 18.800000 411.150000 1410.250000 3801.700000 8684.800000
In [20]:
# df[Categorical].describe().T

The best way to avoid customer churn is to know your customers, and the best way to know your customer is through historical and new customer data.¶

Start with Enlisting Value counts & Sub-categories of different categorial features available¶

In [21]:
Numerical=df.select_dtypes(exclude="object")
Categorical=df.select_dtypes(include="object")
In [22]:
for i in Categorical:
    print(df[i].value_counts())
    print("="*100)
gender
Male      3541
Female    3480
Name: count, dtype: int64
====================================================================================================
Partner
No     3619
Yes    3402
Name: count, dtype: int64
====================================================================================================
Dependents
No     4911
Yes    2110
Name: count, dtype: int64
====================================================================================================
PhoneService
Yes    6339
No      682
Name: count, dtype: int64
====================================================================================================
MultipleLines
No                  3368
Yes                 2971
No phone service     682
Name: count, dtype: int64
====================================================================================================
InternetService
Fiber optic    3090
DSL            2419
No             1512
Name: count, dtype: int64
====================================================================================================
OnlineSecurity
No                     3490
Yes                    2019
No internet service    1512
Name: count, dtype: int64
====================================================================================================
OnlineBackup
No                     3080
Yes                    2429
No internet service    1512
Name: count, dtype: int64
====================================================================================================
DeviceProtection
No                     3087
Yes                    2422
No internet service    1512
Name: count, dtype: int64
====================================================================================================
TechSupport
No                     3465
Yes                    2044
No internet service    1512
Name: count, dtype: int64
====================================================================================================
StreamingTV
No                     2802
Yes                    2707
No internet service    1512
Name: count, dtype: int64
====================================================================================================
StreamingMovies
No                     2777
Yes                    2732
No internet service    1512
Name: count, dtype: int64
====================================================================================================
Contract
Month-to-month    3853
Two year          1695
One year          1473
Name: count, dtype: int64
====================================================================================================
PaperlessBilling
Yes    4161
No     2860
Name: count, dtype: int64
====================================================================================================
PaymentMethod
Electronic check             2359
Mailed check                 1596
Bank transfer (automatic)    1544
Credit card (automatic)      1522
Name: count, dtype: int64
====================================================================================================
Churn
No     5164
Yes    1857
Name: count, dtype: int64
====================================================================================================
In [23]:
sns.set_palette('hsv')
plt.figure(figsize=(20,40), facecolor='white')
plotnumber =1
for i in Categorical:
    if plotnumber <=16:
        ax = plt.subplot(4,4,plotnumber)
        sns.countplot(x=df[i])
        plt.xlabel(i,fontsize=20)
    plotnumber+=1
plt.show()
No description has been provided for this image

Now Start exploreing feature one by one, begin with Target Feature

Target Variable Churn¶

In [24]:
sns.set_palette('husl')
f,ax=plt.subplots(1,2,figsize=(15,8))
df['Churn'].value_counts().plot.pie(explode=[0,0.1],autopct='%3.1f%%', 
                                    ax=ax[0],shadow=True)
ax[0].set_title('Churn Distribution', fontsize=22,fontweight ='bold')
ax[0].set_ylabel('')
sns.countplot(x='Churn',data=df,ax=ax[1])
ax[1].set_title('Churn Distribution',fontsize=22,fontweight ='bold')
ax[1].set_xlabel("Churn",fontsize=18,fontweight ='bold')
plt.xticks(fontsize=18,fontweight ='bold')
plt.show()
No description has been provided for this image

Comment :¶

  • 26.4 % Customer choose to churn service in last month. Which is quite high number.This all leads to imbalanced data case as churn is our target variable.

Let start exploration of Independent feature to figure where customer are unstatisfied and what are customers need or inclination in cutting edge competition.

Gender vs Churn : Can there exist any trend between gender & churn or any impact of gender on Churn?¶

In [25]:
sns.set_palette('husl')
fig,ax=plt.subplots(1,2,figsize=(16,8))
df['gender'].value_counts().plot.pie(explode=[0,0.1],autopct='%2.1f%%',
                ax=ax[0],shadow=True)
ax[0].set_title('Gender', fontsize=20,fontweight ='bold')
ax[0].set_ylabel('')
sns.countplot(x='gender',hue="Churn",data=df,ax=ax[1])
ax[1].set_title('Gender-wise Churn Tendency',fontsize=20,fontweight ='bold')
ax[1].set_xlabel("Churn ",fontsize=18,fontweight ='bold')
plt.xticks(fontsize=14,fontweight ='bold')
plt.tight_layout()
plt.show()
No description has been provided for this image
In [26]:
pd.crosstab(df['gender'],df["Churn"],margins=True).style.background_gradient(cmap='summer_r')
Out[26]:
Churn No Yes All
gender      
Female 2546 934 3480
Male 2618 923 3541
All 5164 1857 7021
In [27]:
plt.figure(figsize=(6, 6))
labels =["Churn: Yes","Churn:No"]
values = [1869,5163]
labels_gender = ["F","M","F","M"]
sizes_gender = [939,930 , 2544,2619]
colors = ['#ff6666', '#66b3ff']
colors_gender = ['#c2c2f0','#ffb3e6', '#c2c2f0','#ffb3e6']
explode = (0.3,0.3) 
explode_gender = (0.1,0.1,0.1,0.1)
textprops = {"fontsize":15}
#Plot
plt.pie(values, labels=labels,autopct='%1.1f%%',pctdistance=1.08, labeldistance=0.8,colors=colors, startangle=90,frame=True, explode=explode,radius=10, textprops =textprops, counterclock = True, )
plt.pie(sizes_gender,labels=labels_gender,colors=colors_gender,startangle=90, explode=explode_gender,radius=7, textprops =textprops, counterclock = True, )
#Draw circle
centre_circle = plt.Circle((0,0),5,color='black', fc='white',linewidth=0)
fig = plt.gcf()
fig.gca().add_artist(centre_circle)

plt.title('Churn Distribution w.r.t Gender: Male(M), Female(F)', fontsize=15, y=1.1)

# show plot 
 
plt.axis('equal')
plt.tight_layout()
plt.show()
No description has been provided for this image

Comment :¶

  • Data contain both gender almost in same proportion with minor difference.
  • Both gender have tendency of attrition in same percentage.

Next Investigate Senior Citizen vs Gender Wise Churn Tendency¶

Lets see how many of them are Senior Citizen and Churn tendency in senior citizen

In [28]:
sns.set_palette('husl')
f,ax=plt.subplots(1,2,figsize=(16,8))
df['SeniorCitizen'].value_counts().plot.pie(explode=[0,0.1],autopct='%2.1f%%',
                                          ax=ax[0],shadow=True)
ax[0].set_title('Senior Citizen Distribution', fontsize=20,fontweight ='bold')
ax[0].set_ylabel('')
sns.countplot(x='SeniorCitizen',hue="Churn",data=df,ax=ax[1])
ax[1].set_title('Senior Citizen-wise Churn Tendency',fontsize=20,fontweight ='bold')
ax[1].set_xlabel("Churn ",fontsize=18,fontweight ='bold')
plt.xticks(fontsize=14,fontweight ='bold')
plt.tight_layout()
plt.show()
No description has been provided for this image

There are only 16.3 % of the customers who are senior citizens. Thus most of our customers in the data are younger people.

In [29]:
pd.crosstab([df.gender,df.SeniorCitizen],df["Churn"],margins=True).style.background_gradient(cmap='summer_r')
Out[29]:
  Churn No Yes All
gender SeniorCitizen      
Female 0 2218 695 2913
1 328 239 567
Male 0 2280 687 2967
1 338 236 574
All 5164 1857 7021
In [30]:
# Comparing tenure and SeniorCitizen
plt.title("Comparison between tenure and SeniorCitizen")
sns.stripplot(x = "SeniorCitizen",y="tenure",data = df)
plt.show()
No description has been provided for this image

Around 16% customer are Senior citizen and form countplot we can see they have more tendency to churn.

There is no significant relation between Senior Citizen and Tenure.

Effect of Partner and Dependents on Churn¶

In [31]:
sns.set_palette('Set1')
f,ax=plt.subplots(1,2,figsize=(16,8))
df['Partner'].value_counts().plot.pie(explode=[0,0.1],autopct='%2.1f%%',
                                          ax=ax[0],shadow=True)
ax[0].set_title('Partner Distribution', fontsize=20,fontweight ='bold')
ax[0].set_ylabel('')
sns.countplot(x='Partner',hue="Churn",data=df,ax=ax[1])
ax[1].set_title('Effect of Partner on Churn Tendency',fontsize=20,fontweight ='bold')
ax[1].set_xlabel("Partner ",fontsize=18,fontweight ='bold')
plt.xticks(fontsize=14,fontweight ='bold')
plt.tight_layout()
plt.show()
No description has been provided for this image
In [32]:
sns.set_palette('rainbow')
f,ax=plt.subplots(1,2,figsize=(16,8))
df['Dependents'].value_counts().plot.pie(explode=[0,0.1],autopct='%2.1f%%',
                                          ax=ax[0],shadow=True)
ax[0].set_title('Dependents Distribution', fontsize=20,fontweight ='bold')
ax[0].set_ylabel('')
sns.countplot(x='Dependents',hue="Churn",data=df,ax=ax[1])
ax[1].set_title('Effect of Dependents on Churn Tendency',fontsize=20,fontweight ='bold')
ax[1].set_xlabel("Dependents ",fontsize=18,fontweight ='bold')
plt.xticks(fontsize=14,fontweight ='bold')
plt.tight_layout()
plt.show()
No description has been provided for this image

Observation-¶

  • Customer having Partner have less tendency to Churn.
  • Almost 30% Customer have dependents on them and they also have less tendency to churn compare to remaining 70%
In [33]:
#plt.rcParams["figure.autolayout"] = True
sns.set_palette('coolwarm')
f,ax=plt.subplots(1,2,figsize=(16,8))
df['StreamingTV'].value_counts().plot.pie(explode=[0.03,0.03,0.03],autopct='%2.1f%%',
                                          ax=ax[0],shadow=True)
ax[0].set_title('StreamingTV Distribution', fontsize=20,fontweight ='bold')
ax[0].set_ylabel('')
sns.countplot(x='StreamingTV',hue="Churn",data=df,ax=ax[1])
ax[1].set_title('Effect of StreamingTV on Churn Tendency',fontsize=20,fontweight ='bold')
ax[1].set_xlabel("StreamingTV ",fontsize=18,fontweight ='bold')
plt.xticks(fontsize=14,fontweight ='bold')
plt.tight_layout()
plt.show()
No description has been provided for this image
In [34]:
#plt.rcParams["figure.autolayout"] = True
sns.set_palette('tab10')
f,ax=plt.subplots(1,2,figsize=(16,8))
df['InternetService'].value_counts().plot.pie(explode=[0.03,0.03,0.03],autopct='%2.1f%%',
                                          ax=ax[0],shadow=True)
ax[0].set_title('InternetService Distribution', fontsize=20,fontweight ='bold')
ax[0].set_ylabel('')
sns.countplot(x='InternetService',hue="Churn",data=df,ax=ax[1])
ax[1].set_title('Effect of InternetService on Churn Tendency',fontsize=20,fontweight ='bold')
ax[1].set_xlabel("InternetService ",fontsize=18,fontweight ='bold')
plt.xticks(fontsize=14,fontweight ='bold')
plt.tight_layout()
plt.show()
No description has been provided for this image
In [35]:
plt.figure(figsize=(8,5))
sns.scatterplot(x="InternetService", y='MonthlyCharges',data=df,hue="Churn")
plt.show()
No description has been provided for this image

44% Customer perfer Fibre optic as Interent service and surpringly we can find high churn rate among them.

We can find high monthly charges among customer using fiber optic compare to DSL. We can conclude that High charges is reason of customer churn.

In [36]:
plt.rcParams["figure.autolayout"] = True
sns.set_palette('rainbow_r')

f, ax = plt.subplots(1, 2, figsize=(16, 8))

df['StreamingMovies'].value_counts().plot.pie(
    explode=[0.03, 0.03, 0.03],
    autopct='%2.1f%%',
    ax=ax[0],
    shadow=True
)

ax[0].set_title('StreamingMovies Distribution', fontsize=20, fontweight='bold')
ax[0].set_ylabel('')


sns.countplot(x='StreamingMovies', hue='Churn', data=df, ax=ax[1])

ax[1].set_title('Effect of StreamingMovies on Churn Tendency', fontsize=20, fontweight='bold')
ax[1].set_xlabel("StreamingMovies", fontsize=18, fontweight='bold')

plt.xticks(fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()
No description has been provided for this image

Almost same churn tendency in people streaming movies and not.

In [37]:
#plt.rcParams["figure.autolayout"] = True
sns.set_palette('husl')
f, ax = plt.subplots(1, 2, figsize=(16, 8))
df['Contract'].value_counts().plot.pie(
    explode=[0.03, 0.03, 0.03],
    autopct='%2.1f%%',
    ax=ax[0],
    shadow=True
)
ax[0].set_title('Contract Distribution', fontsize=20, fontweight='bold')
ax[0].set_ylabel('')

sns.countplot(x='Contract', hue='Churn', data=df, ax=ax[1])

ax[1].set_title('Effect of Contract on Churn Tendency', fontsize=20, fontweight='bold')
ax[1].set_xlabel("Contract", fontsize=18, fontweight='bold')

plt.xticks(fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()
No description has been provided for this image
In [38]:
plt.figure(figsize=(8,5))
sns.scatterplot(x="Contract", y='MonthlyCharges',data=df,hue="Churn")
plt.show()
No description has been provided for this image

Almost 55% customer perfer month to month contract compare to other.We also find high churn rate in these customer.

We did not find any relation between Monthly charges and contract tenure.

In [39]:
plt.rcParams["figure.autolayout"] = True
sns.set_palette('gist_earth')
f,ax=plt.subplots(1,2,figsize=(16,8))
df['PaperlessBilling'].value_counts().plot.pie(explode=[0.03,0.03],autopct='%2.1f%%',
                                          ax=ax[0],shadow=True)
ax[0].set_title('PaperlessBilling Distribution', fontsize=20,fontweight ='bold')
ax[0].set_ylabel('')
sns.countplot(x = 'PaperlessBilling',hue="Churn",data=df,ax=ax[1])
ax[1].set_title('Effect of PaperlessBilling on Churn Tendency',fontsize=20,fontweight ='bold')
ax[1].set_xlabel("PaperlessBilling ",fontsize=18,fontweight ='bold')
plt.xticks(fontsize=14,fontweight ='bold')
plt.tight_layout()
plt.show()
No description has been provided for this image

60% Customer perfer paperless billing.

The customers who prefer paperless billing they have high churn rate.

In [40]:
#plt.rcParams["figure.autolayout"] = True
sns.set_palette('coolwarm')
f,ax=plt.subplots(1,2,figsize=(16,8))
df['PaymentMethod'].value_counts().plot.pie(explode=[0.03,0.03,0.03,0.03],autopct='%2.1f%%',
                                          ax=ax[0],shadow=True)
ax[0].set_title('Payment Method Distribution', fontsize=20,fontweight ='bold')
ax[0].set_ylabel('')
sns.countplot(x = 'PaymentMethod',hue="Churn",data=df,ax=ax[1])
ax[1].set_title('Effect of PaymentMethod on Churn Tendency',fontsize=20,fontweight ='bold')
ax[1].set_xlabel("Payment Method ",fontsize=18,fontweight ='bold')
plt.xticks(fontsize=12,rotation=15)
plt.tight_layout()
plt.show()
No description has been provided for this image

We can see high Attrition tendency in customer who pay by Electronic check.

In [41]:
sns.set_palette('tab20_r')
fig , ax=plt.subplots(2,2, figsize=(15,10))
for i,col in enumerate(["MonthlyCharges","TotalCharges"]):
    sns.scatterplot(ax=ax[0,i],x="tenure", y=col,data=df,hue="Churn")
    sns.lineplot(ax=ax[1,i],x="tenure", y=col,data=df,hue="Churn")
No description has been provided for this image

Observation:¶

  • High Monthly Charges in customer who choose churn compare to rest.
  • Same goes with High Total Charges in customer who choose churn compare to rest.
In [42]:
sns.pairplot(df,hue="Churn",palette="Dark2")
plt.show()
No description has been provided for this image

Encoding categorical data¶

In [43]:
df.columns.to_series().groupby(df.dtypes).groups
Out[43]:
{int64: ['SeniorCitizen', 'tenure'], float64: ['MonthlyCharges', 'TotalCharges'], object: ['gender', 'Partner', 'Dependents', 'PhoneService', 'MultipleLines', 'InternetService', 'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport', 'StreamingTV', 'StreamingMovies', 'Contract', 'PaperlessBilling', 'PaymentMethod', 'Churn']}
In [44]:
df.head()
Out[44]:
gender SeniorCitizen Partner Dependents tenure PhoneService MultipleLines InternetService OnlineSecurity OnlineBackup DeviceProtection TechSupport StreamingTV StreamingMovies Contract PaperlessBilling PaymentMethod MonthlyCharges TotalCharges Churn
0 Female 0 Yes No 1 No No phone service DSL No Yes No No No No Month-to-month Yes Electronic check 29.85 29.85 No
1 Male 0 No No 34 Yes No DSL Yes No Yes No No No One year No Mailed check 56.95 1889.50 No
2 Male 0 No No 2 Yes No DSL Yes Yes No No No No Month-to-month Yes Mailed check 53.85 108.15 Yes
3 Male 0 No No 45 No No phone service DSL Yes No Yes Yes No No One year No Bank transfer (automatic) 42.30 1840.75 No
4 Female 0 No No 2 Yes No Fiber optic No No No No No No Month-to-month Yes Electronic check 70.70 151.65 Yes
In [45]:
Numerical =['tenure','MonthlyCharges', 'TotalCharges']
In [46]:
# df["--"] = df["--"].astype("object")
In [47]:
Category =['gender', 'Partner','PhoneService', 'Dependents', 'MultipleLines', 'InternetService', 'OnlineSecurity', 
           'OnlineBackup', 'DeviceProtection', 'TechSupport','StreamingTV', 'StreamingMovies', 
           'Contract', 'PaperlessBilling', 'PaymentMethod', 'Churn']
In [ ]:
 
In [48]:
# Using Label Encoder on categorical variable
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
Categorical=df.select_dtypes(include="object")
for i in Categorical:
    df[i] = le.fit_transform(df[i])
df.head()
Out[48]:
gender SeniorCitizen Partner Dependents tenure PhoneService MultipleLines InternetService OnlineSecurity OnlineBackup DeviceProtection TechSupport StreamingTV StreamingMovies Contract PaperlessBilling PaymentMethod MonthlyCharges TotalCharges Churn
0 0 0 1 0 1 0 1 0 0 2 0 0 0 0 0 1 2 29.85 29.85 0
1 1 0 0 0 34 1 0 0 2 0 2 0 0 0 1 0 3 56.95 1889.50 0
2 1 0 0 0 2 1 0 0 2 2 0 0 0 0 0 1 3 53.85 108.15 1
3 1 0 0 0 45 0 1 0 2 0 2 2 0 0 1 0 0 42.30 1840.75 0
4 0 0 0 0 2 1 0 1 0 0 0 0 0 0 0 1 2 70.70 151.65 1

Feature selection and Engineering¶

1. Outliers Detection and Removal¶

In [49]:
plt.figure(figsize=(18,10),facecolor='white')
plotnumber=1

for column in Numerical:
    if plotnumber<=4:
        ax=plt.subplot(2,2,plotnumber)
        sns.boxplot(df[column],color='g')
        plt.xlabel(column,fontsize=20)
    plotnumber+=1
plt.show()
No description has been provided for this image

From Boxplot we can see No outliers exist dataset.

Outliers removal using Zscore method¶

In [50]:
from scipy.stats import zscore
z = np.abs(zscore(df))
df1 = df[(z<3).all(axis = 1)]

print ("Shape of the dataframe before removing outliers: ", df.shape)
print ("Shape of the dataframe after removing outliers: ", df1.shape)
print ("Percentage of data loss post outlier removal: ", (df.shape[0]-df1.shape[0])/df.shape[0]*100)

df=df1.copy() # reassigning the changed dataframe name to our original dataframe name
Shape of the dataframe before removing outliers:  (7021, 20)
Shape of the dataframe after removing outliers:  (6339, 20)
Percentage of data loss post outlier removal:  9.713715994872524
In [51]:
df['PhoneService'].unique()
Out[51]:
array([1])
In [52]:
df.drop(['PhoneService'],axis=1,inplace=True)
In [53]:
df.skew() #-0.5 to 0.5---
Out[53]:
gender             -0.012939
SeniorCitizen       1.819335
Partner             0.049562
Dependents          0.871194
tenure              0.233517
MultipleLines       0.125532
InternetService     0.051965
OnlineSecurity      0.421216
OnlineBackup        0.166121
DeviceProtection    0.181524
TechSupport         0.408970
StreamingTV        -0.005185
StreamingMovies    -0.012505
Contract            0.624212
PaperlessBilling   -0.388673
PaymentMethod      -0.165613
MonthlyCharges     -0.404120
TotalCharges        0.895850
Churn               1.058644
dtype: float64
In [54]:
num=["tenure","MonthlyCharges","TotalCharges"]

2. Skewness of features¶

In [55]:
plt.figure(figsize=(20,5),facecolor='white')
sns.set_palette('plasma')
plotnum=1
for col in num:
    if plotnum<=4:
        plt.subplot(2,2,plotnum)
        sns.distplot(df[col])
        plt.xlabel(col,fontsize=20)
    plotnum+=1
plt.show()
No description has been provided for this image

Skewness is important feature for continous data.

There is no relevence of skweness for discrete numerical feature like month and categorical feature.So we gone ignore skewness present in discrete numerical and categorical feature.

In [56]:
df.skew()  #-0.5 se 0.5
Out[56]:
gender             -0.012939
SeniorCitizen       1.819335
Partner             0.049562
Dependents          0.871194
tenure              0.233517
MultipleLines       0.125532
InternetService     0.051965
OnlineSecurity      0.421216
OnlineBackup        0.166121
DeviceProtection    0.181524
TechSupport         0.408970
StreamingTV        -0.005185
StreamingMovies    -0.012505
Contract            0.624212
PaperlessBilling   -0.388673
PaymentMethod      -0.165613
MonthlyCharges     -0.404120
TotalCharges        0.895850
Churn               1.058644
dtype: float64

'tenure','MonthlyCharges', 'TotalCharges' are continous numerical feature in dataset.

Out of which TotalCharges is skewed in nature. Which we gone transform here.

In [57]:
df['TotalCharges'] = np.log1p(df['TotalCharges'])

3. Corrleation¶

In [58]:
df.corr()
Out[58]:
gender SeniorCitizen Partner Dependents tenure MultipleLines InternetService OnlineSecurity OnlineBackup DeviceProtection TechSupport StreamingTV StreamingMovies Contract PaperlessBilling PaymentMethod MonthlyCharges TotalCharges Churn
gender 1.000000 -0.005846 -0.002207 0.015722 0.001891 -0.006391 0.000983 -0.016826 -0.009353 -0.003121 -0.009769 -0.005624 -0.002760 0.000674 -0.018131 0.021961 -0.011639 -0.006783 -0.011391
SeniorCitizen -0.005846 1.000000 0.013943 -0.213486 0.017647 0.152954 -0.039479 -0.123668 -0.020710 -0.023590 -0.144694 0.028453 0.047062 -0.141107 0.155193 -0.041891 0.238426 0.111597 0.149599
Partner -0.002207 0.013943 1.000000 0.453409 0.382932 0.147545 -0.004099 0.151348 0.154738 0.167390 0.132266 0.133353 0.127676 0.297393 -0.010458 -0.147854 0.088571 0.337926 -0.153262
Dependents 0.015722 -0.213486 0.453409 1.000000 0.159194 -0.028535 0.053701 0.146427 0.090389 0.082944 0.130166 0.048859 0.023932 0.242286 -0.106970 -0.037411 -0.131791 0.084275 -0.158628
tenure 0.001891 0.017647 0.382932 0.159194 1.000000 0.358098 -0.034932 0.326356 0.377187 0.367678 0.324457 0.282710 0.292966 0.674586 0.002370 -0.361878 0.242184 0.827354 -0.348882
MultipleLines -0.006391 0.152954 0.147545 -0.028535 0.358098 1.000000 -0.107675 0.006752 0.125043 0.130055 0.011287 0.187307 0.193380 0.114261 0.174017 -0.183244 0.454819 0.458583 0.042438
InternetService 0.000983 -0.039479 -0.004099 0.053701 -0.034932 -0.107675 1.000000 -0.027406 0.030417 0.049829 -0.022841 0.099513 0.094169 0.115528 -0.164085 0.096674 -0.470605 -0.260767 -0.058968
OnlineSecurity -0.016826 -0.123668 0.151348 0.146427 0.326356 0.006752 -0.027406 1.000000 0.198167 0.173275 0.283252 0.046717 0.062345 0.367667 -0.154346 -0.089597 -0.071808 0.207795 -0.289182
OnlineBackup -0.009353 -0.020710 0.154738 0.090389 0.377187 0.125043 0.030417 0.198167 1.000000 0.195604 0.210090 0.151646 0.139587 0.286126 -0.019141 -0.126394 0.110079 0.310079 -0.201206
DeviceProtection -0.003121 -0.023590 0.167390 0.082944 0.367678 0.130055 0.049829 0.173275 0.195604 1.000000 0.241956 0.278088 0.284397 0.342751 -0.040732 -0.132907 0.154859 0.318027 -0.176171
TechSupport -0.009769 -0.144694 0.132266 0.130166 0.324457 0.011287 -0.022841 0.283252 0.210090 0.241956 1.000000 0.174169 0.179502 0.417344 -0.107286 -0.104360 -0.022495 0.225559 -0.279455
StreamingTV -0.005624 0.028453 0.133353 0.048859 0.282710 0.187307 0.099513 0.046717 0.151646 0.278088 0.174169 1.000000 0.429865 0.226185 0.092405 -0.094395 0.326687 0.315502 -0.033758
StreamingMovies -0.002760 0.047062 0.127676 0.023932 0.292966 0.193380 0.094169 0.062345 0.139587 0.284397 0.179502 0.429865 1.000000 0.232550 0.071674 -0.100094 0.328388 0.326481 -0.039644
Contract 0.000674 -0.141107 0.297393 0.242286 0.674586 0.114261 0.115528 0.367667 0.286126 0.342751 0.417344 0.226185 0.232550 1.000000 -0.179155 -0.222411 -0.106646 0.427736 -0.396873
PaperlessBilling -0.018131 0.155193 -0.010458 -0.106970 0.002370 0.174017 -0.164085 -0.154346 -0.019141 -0.040732 -0.107286 0.092405 0.071674 -0.179155 1.000000 -0.062858 0.377321 0.151738 0.195364
PaymentMethod 0.021961 -0.041891 -0.147854 -0.037411 -0.361878 -0.183244 0.096674 -0.089597 -0.126394 -0.132907 -0.104360 -0.094395 -0.100094 -0.222411 -0.062858 1.000000 -0.195322 -0.363128 0.103054
MonthlyCharges -0.011639 0.238426 0.088571 -0.131791 0.242184 0.454819 -0.470605 -0.071808 0.110079 0.154859 -0.022495 0.326687 0.328388 -0.106646 0.377321 -0.195322 1.000000 0.579093 0.218422
TotalCharges -0.006783 0.111597 0.337926 0.084275 0.827354 0.458583 -0.260767 0.207795 0.310079 0.318027 0.225559 0.315502 0.326481 0.427736 0.151738 -0.363128 0.579093 1.000000 -0.223677
Churn -0.011391 0.149599 -0.153262 -0.158628 -0.348882 0.042438 -0.058968 -0.289182 -0.201206 -0.176171 -0.279455 -0.033758 -0.039644 -0.396873 0.195364 0.103054 0.218422 -0.223677 1.000000
In [59]:
plt.figure(figsize=(25,15))
sns.heatmap(df.corr(), vmin=-1, vmax=1, annot=True, square=True, fmt='0.3f', 
            annot_kws={'size':10}, cmap="gist_stern")
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)
plt.show()
No description has been provided for this image
In [60]:
plt.figure(figsize = (18,6))
df.corr()['Churn'].drop(['Churn']).sort_values(ascending=False).plot(kind='bar',color = 'purple')
plt.xlabel('Features',fontsize=15,fontweight='bold')
plt.ylabel('Churn',fontsize=15,fontweight='bold')
plt.title('Correlation of features with Target Variable Churn',fontsize = 20,fontweight='bold')
plt.show()
No description has been provided for this image

4. Balanceing Imbalanced target feature¶

In [61]:
df.Churn.value_counts()
Out[61]:
Churn
0    4652
1    1687
Name: count, dtype: int64

As Target variable data is Imbalanced in nature we will need to balance target variable.

Balancing using SMOTE¶

In [62]:
from imblearn.over_sampling import SMOTE
In [63]:
# Splitting data in target and dependent feature
X = df.drop(['Churn'], axis =1)
Y = df['Churn']
In [64]:
# Oversampleing using SMOTE Techniques
oversample = SMOTE()
X, Y = oversample.fit_resample(X, Y)
In [65]:
Y.value_counts()
Out[65]:
Churn
0    4652
1    4652
Name: count, dtype: int64

We have successfully resolved the class imbalanced problem and now all the categories have same data ensuring that the ML model does not get biased towards one category.

Standard Scaling¶

In [66]:
from sklearn.preprocessing import StandardScaler
scaler= StandardScaler()
X_scale = scaler.fit_transform(X)

5. Checking Multicollinearity between features using variance_inflation_factor¶

In [67]:
from statsmodels.stats.outliers_influence import variance_inflation_factor
vif = pd.DataFrame()
vif["VIF values"] = [variance_inflation_factor(X_scale,i) for i in range(len(X.columns))]
vif["Features"] = X.columns
vif
Out[67]:
VIF values Features
0 1.014034 gender
1 1.097927 SeniorCitizen
2 1.541789 Partner
3 1.428879 Dependents
4 6.495017 tenure
5 1.428823 MultipleLines
6 1.470469 InternetService
7 1.345447 OnlineSecurity
8 1.250707 OnlineBackup
9 1.318419 DeviceProtection
10 1.396276 TechSupport
11 1.513197 StreamingTV
12 1.488277 StreamingMovies
13 2.540526 Contract
14 1.181381 PaperlessBilling
15 1.172448 PaymentMethod
16 3.176709 MonthlyCharges
17 5.959530 TotalCharges

Independent feature VIF is within permissible limit of 10

Machine Learning Model Building¶

In [68]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, confusion_matrix,classification_report
In [69]:
X_train, X_test, Y_train, Y_test = train_test_split(X_scale, Y, random_state=99, test_size=.3)
# print('Training feature matrix size:',X_train.shape)
# print('Training target vector size:',Y_train.shape)
# print('Test feature matrix size:',X_test.shape)
# print('Test target vector size:',Y_test.shape)

Finding best Random state¶

In [70]:
#from sklearn.linear_model import LogisticRegression
#from sklearn.tree import DecisionTreeClassifier
#from sklearn.metrics import accuracy_score, confusion_matrix,classification_report,f1_score
maxAccu=0
maxRS=0
for i in range(1,250):
    X_train,X_test,Y_train,Y_test = train_test_split(X_scale,Y,test_size = 0.3, random_state=i)
    lr=LogisticRegression()
    lr.fit(X_train,Y_train)
    y_pred=lr.predict(X_test)
    acc=accuracy_score(Y_test,y_pred)
    if acc>maxAccu:
        maxAccu=acc
        maxRS=i
print('Best accuracy is', maxAccu ,'on Random_state', maxRS)
Best accuracy is 0.8130372492836676 on Random_state 11
In [71]:
X_train, X_test, Y_train, Y_test = train_test_split(X_scale, Y, random_state=90, test_size=.3)
lr=LogisticRegression()
lr.fit(X_train,Y_train)
y_pred=lr.predict(X_test)
print(classification_report(Y_test, y_pred))
print()
print(confusion_matrix(Y_test, y_pred))
              precision    recall  f1-score   support

           0       0.85      0.76      0.80      1415
           1       0.78      0.86      0.82      1377

    accuracy                           0.81      2792
   macro avg       0.82      0.81      0.81      2792
weighted avg       0.82      0.81      0.81      2792


[[1078  337]
 [ 188 1189]]

Finding Optimal value of n_neighbors for KNN¶

In [72]:
from sklearn import neighbors
from math import sqrt
from sklearn.metrics import mean_squared_error
rmse_val = [] #to store rmse values for different k
for K in range(20):
    K = K+1
    model = neighbors.KNeighborsClassifier(n_neighbors = K)

    model.fit(X_train,Y_train)  #fit the model
    y_pred=model.predict(X_test) #make prediction on test set
    error = sqrt(mean_squared_error(Y_test,y_pred)) #calculate rmse
    rmse_val.append(error) #store rmse values
    print('RMSE value for k= ' , K , 'is:', error)
RMSE value for k=  1 is: 0.4791497962230918
RMSE value for k=  2 is: 0.49351090017950827
RMSE value for k=  3 is: 0.45735003491992954
RMSE value for k=  4 is: 0.4534174594676999
RMSE value for k=  5 is: 0.4565662296568125
RMSE value for k=  6 is: 0.45578107648827826
RMSE value for k=  7 is: 0.46047191440320734
RMSE value for k=  8 is: 0.45617382199543793
RMSE value for k=  9 is: 0.4577414342512293
RMSE value for k=  10 is: 0.46008283793795546
RMSE value for k=  11 is: 0.45617382199543793
RMSE value for k=  12 is: 0.4549945684363609
RMSE value for k=  13 is: 0.45420669846267875
RMSE value for k=  14 is: 0.4494504763141606
RMSE value for k=  15 is: 0.45420669846267875
RMSE value for k=  16 is: 0.4518348457054815
RMSE value for k=  17 is: 0.4514383253608233
RMSE value for k=  18 is: 0.45262684428999544
RMSE value for k=  19 is: 0.4502466691017817
RMSE value for k=  20 is: 0.45223101837756335
In [73]:
#plotting the rmse values against k values -
plt.figure(figsize = (8,6))
plt.plot(range(20), rmse_val, color='blue', linestyle='dashed', marker='o', markerfacecolor='green', markersize=10)
plt.show()
No description has been provided for this image

Comment-¶

At k=18, we get the minimum RMSE value which approximately 0.44059740636840716, and shoots up on further increasing the k value. We can safely say that k=18 will give us the best result in this case

Applying other classification algorithm¶

In [74]:
model=[ LogisticRegression(),
        SVC(),
        GaussianNB(),
        DecisionTreeClassifier(),
        KNeighborsClassifier(n_neighbors = 18),
        RandomForestClassifier(),
        ExtraTreesClassifier()]
        
for m in model:
    m.fit(X_train,Y_train)
    y_pred=m.predict(X_test)
    print('\033[1m'+'Classification ML Algorithm Evaluation Matrix',m,'is' +'\033[0m')
    print('\n')
    print('\033[1m'+'Accuracy Score :'+'\033[0m\n', accuracy_score(Y_test, y_pred))
    print('\n')
    print('\033[1m'+'Confusion matrix :'+'\033[0m \n',confusion_matrix(Y_test, y_pred))
    print('\n')
    print('\033[1m'+'Classification Report :'+'\033[0m \n',classification_report(Y_test, y_pred))
    print('\n')
    print('============================================================================================================')
Classification ML Algorithm Evaluation Matrix LogisticRegression() is


Accuracy Score :
 0.8119627507163324


Confusion matrix : 
 [[1078  337]
 [ 188 1189]]


Classification Report : 
               precision    recall  f1-score   support

           0       0.85      0.76      0.80      1415
           1       0.78      0.86      0.82      1377

    accuracy                           0.81      2792
   macro avg       0.82      0.81      0.81      2792
weighted avg       0.82      0.81      0.81      2792



============================================================================================================
Classification ML Algorithm Evaluation Matrix SVC() is


Accuracy Score :
 0.8330945558739254


Confusion matrix : 
 [[1117  298]
 [ 168 1209]]


Classification Report : 
               precision    recall  f1-score   support

           0       0.87      0.79      0.83      1415
           1       0.80      0.88      0.84      1377

    accuracy                           0.83      2792
   macro avg       0.84      0.83      0.83      2792
weighted avg       0.84      0.83      0.83      2792



============================================================================================================
Classification ML Algorithm Evaluation Matrix GaussianNB() is


Accuracy Score :
 0.794054441260745


Confusion matrix : 
 [[1075  340]
 [ 235 1142]]


Classification Report : 
               precision    recall  f1-score   support

           0       0.82      0.76      0.79      1415
           1       0.77      0.83      0.80      1377

    accuracy                           0.79      2792
   macro avg       0.80      0.79      0.79      2792
weighted avg       0.80      0.79      0.79      2792



============================================================================================================
Classification ML Algorithm Evaluation Matrix DecisionTreeClassifier() is


Accuracy Score :
 0.7872492836676218


Confusion matrix : 
 [[1089  326]
 [ 268 1109]]


Classification Report : 
               precision    recall  f1-score   support

           0       0.80      0.77      0.79      1415
           1       0.77      0.81      0.79      1377

    accuracy                           0.79      2792
   macro avg       0.79      0.79      0.79      2792
weighted avg       0.79      0.79      0.79      2792



============================================================================================================
Classification ML Algorithm Evaluation Matrix KNeighborsClassifier(n_neighbors=18) is


Accuracy Score :
 0.7951289398280802


Confusion matrix : 
 [[1018  397]
 [ 175 1202]]


Classification Report : 
               precision    recall  f1-score   support

           0       0.85      0.72      0.78      1415
           1       0.75      0.87      0.81      1377

    accuracy                           0.80      2792
   macro avg       0.80      0.80      0.79      2792
weighted avg       0.80      0.80      0.79      2792



============================================================================================================
Classification ML Algorithm Evaluation Matrix RandomForestClassifier() is


Accuracy Score :
 0.8535100286532952


Confusion matrix : 
 [[1197  218]
 [ 191 1186]]


Classification Report : 
               precision    recall  f1-score   support

           0       0.86      0.85      0.85      1415
           1       0.84      0.86      0.85      1377

    accuracy                           0.85      2792
   macro avg       0.85      0.85      0.85      2792
weighted avg       0.85      0.85      0.85      2792



============================================================================================================
Classification ML Algorithm Evaluation Matrix ExtraTreesClassifier() is


Accuracy Score :
 0.8384670487106017


Confusion matrix : 
 [[1187  228]
 [ 223 1154]]


Classification Report : 
               precision    recall  f1-score   support

           0       0.84      0.84      0.84      1415
           1       0.84      0.84      0.84      1377

    accuracy                           0.84      2792
   macro avg       0.84      0.84      0.84      2792
weighted avg       0.84      0.84      0.84      2792



============================================================================================================

CrossValidation :¶

In [75]:
from sklearn.model_selection import cross_val_score
model=[LogisticRegression(),
        SVC(),
        GaussianNB(),
        DecisionTreeClassifier(),
        KNeighborsClassifier(n_neighbors = 18),
        RandomForestClassifier(),
        ExtraTreesClassifier()]

for m in model:
    score = cross_val_score(m, X, Y, cv =5)
    print('\n')
    print('\033[1m'+'Cross Validation Score', m, ':'+'\033[0m\n')
    print("Score :" ,score)
    print("Mean Score :",score.mean())
    print("Std deviation :",score.std())
    print('\n')
    print('============================================================================================================')
    

Cross Validation Score LogisticRegression() :

Score : [0.73777539 0.7485223  0.80763031 0.80763031 0.8155914 ]
Mean Score : 0.7834299399675282
Std deviation : 0.033192035622825765


============================================================================================================


Cross Validation Score SVC() :

Score : [0.75067168 0.75013434 0.76786674 0.76732939 0.76451613]
Mean Score : 0.7601036556828621
Std deviation : 0.008003700479845513


============================================================================================================


Cross Validation Score GaussianNB() :

Score : [0.74314884 0.75174637 0.78828587 0.80601827 0.80860215]
Mean Score : 0.7795603011446037
Std deviation : 0.027272687420731516


============================================================================================================


Cross Validation Score DecisionTreeClassifier() :

Score : [0.71843095 0.75174637 0.83127351 0.83073616 0.82258065]
Mean Score : 0.7909535282799743
Std deviation : 0.04691558087085482


============================================================================================================


Cross Validation Score KNeighborsClassifier(n_neighbors=18) :

Score : [0.75980656 0.77485223 0.79043525 0.79742074 0.79569892]
Mean Score : 0.783642740346559
Std deviation : 0.014330107182171665


============================================================================================================


Cross Validation Score RandomForestClassifier() :

Score : [0.77431488 0.80010747 0.88285868 0.88823213 0.89623656]
Mean Score : 0.8483499448209715
Std deviation : 0.05076040993695079


============================================================================================================


Cross Validation Score ExtraTreesClassifier() :

Score : [0.75389575 0.77700161 0.88393337 0.8866201  0.88924731]
Mean Score : 0.8381396289427006
Std deviation : 0.059823582410026624


============================================================================================================

Hyper Parameter Tuning : GridSearchCV¶

In [76]:
from sklearn.model_selection import GridSearchCV
In [77]:
parameter = {  'max_depth': [5, 10,20,40,50,60], 
              'criterion':['gini','entropy']}
In [78]:
GCV = GridSearchCV(DecisionTreeClassifier(),parameter)
GCV.fit(X_train,Y_train)
Out[78]:
GridSearchCV(estimator=DecisionTreeClassifier(),
             param_grid={'criterion': ['gini', 'entropy'],
                         'max_depth': [5, 10, 20, 40, 50, 60]})
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GridSearchCV(estimator=DecisionTreeClassifier(),
             param_grid={'criterion': ['gini', 'entropy'],
                         'max_depth': [5, 10, 20, 40, 50, 60]})
DecisionTreeClassifier(max_depth=10)
DecisionTreeClassifier(max_depth=10)
In [79]:
GCV.best_params_
Out[79]:
{'criterion': 'gini', 'max_depth': 10}
In [80]:
dtc1=DecisionTreeClassifier(max_depth=40,criterion="entropy")
In [81]:
dtc1.fit(X_train,Y_train)
y_pred=dtc1.predict(X_test)
print(classification_report(y_pred,Y_test))
              precision    recall  f1-score   support

           0       0.78      0.80      0.79      1382
           1       0.80      0.78      0.79      1410

    accuracy                           0.79      2792
   macro avg       0.79      0.79      0.79      2792
weighted avg       0.79      0.79      0.79      2792

Final Model¶

In [82]:
Final_mod = RandomForestClassifier(bootstrap=True,criterion='entropy',n_estimators= 60, max_depth=10 ,max_features='sqrt')
Final_mod.fit(X_train,Y_train)
y_pred=Final_mod.predict(X_test)
print('\033[1m'+'Accuracy Score :'+'\033[0m\n', accuracy_score(Y_test, y_pred))
Accuracy Score :
 0.8531518624641834
In [83]:
# Lets plot confusion matrix for  FinalModel
Matrix = confusion_matrix(Y_test, y_pred)

x_labels = ["NO","YES"]
y_labels = ["NO","YES"]

fig , ax = plt.subplots(figsize=(5,5))
sns.heatmap(Matrix, annot = True,linewidths=.2, linecolor="black", fmt = ".0f", ax=ax, 
            cmap="plasma", xticklabels = x_labels, yticklabels = y_labels)

plt.xlabel("Predicted Label",fontsize=14,fontweight='bold')
plt.ylabel("True Label",fontsize=14,fontweight='bold')
plt.title('Confusion Matrix for Final Model',fontsize=20,fontweight='bold')
plt.show()
No description has been provided for this image
In [84]:
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve

from sklearn.metrics import RocCurveDisplay

RocCurveDisplay.from_estimator(Final_mod, X_test, Y_test)   
plt.legend(prop={'size':11}, loc='lower right')
plt.title('AOC ROC Curve of Final Model',fontsize=20,fontweight='bold')
plt.show()

auc_score = roc_auc_score(Y_test, Final_mod.predict(X_test))
print('\033[1m'+'Auc Score :'+'\033[0m\n',auc_score)
No description has been provided for this image
Auc Score :
 0.8536414749121739

Saving model¶

In [85]:
import joblib
joblib.dump(Final_mod,'Customer_Churn_Final.pkl')
Out[85]:
['Customer_Churn_Final.pkl']

Predicting the Final Model¶

In [86]:
# Prediction
prediction = Final_mod.predict(X_test)
In [87]:
Actual = np.array(Y_test)
df_Pred = pd.DataFrame()
df_Pred["Predicted Values"] = prediction
df_Pred["Actual Values"] = Actual
df_Pred.head()
Out[87]:
Predicted Values Actual Values
0 1 1
1 1 1
2 1 1
3 1 1
4 0 0
In [ ]: